1.The combustion processes of fossil fuels used in power plants andvehicles comprise the major portion of air pollution. NOx (NOx = NO2 + NO) are considered the primarypollutants of the atmosphere, since they are responsible for environmental problems such as photochemicalsmog, acid rain, tropospheric ozone, ozone layer depletion, and eventually global warming.
2.An important source of harmful pollutants (NOx and CO) released in the atmosphere is the combustion process in the power industry. Therefore, there is a special concern on reducing the emissions from powerplants.NOx and CO emissions are limited to 25 ppmdv (parts per million by dry volume) by the EU when natural gas is used as fuel.
Objective: In order to determine which factor will be the primary contributor to the amount of CO and NOx increasing, we want to examine the correlation between CO, NOX, and other qualities and try to implement a model.
# Importing Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Import Dataset
df=pd.read_csv("C:\\Users\\soura\\OneDrive\\Desktop\\gt_2013.csv")
df
| AT | AP | AH | AFDP | GTEP | TIT | TAT | TEY | CDP | CO | NOX | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 9.3779 | 1020.1 | 90.262 | 2.3927 | 19.166 | 1043.6 | 541.16 | 110.16 | 10.564 | 9.3472 | 98.741 |
| 1 | 9.2985 | 1019.9 | 89.934 | 2.3732 | 19.119 | 1039.9 | 538.94 | 109.23 | 10.572 | 11.0160 | 104.290 |
| 2 | 9.1337 | 1019.8 | 89.868 | 2.3854 | 19.178 | 1041.0 | 539.47 | 109.62 | 10.543 | 10.7500 | 103.470 |
| 3 | 8.9715 | 1019.3 | 89.490 | 2.3825 | 19.180 | 1037.1 | 536.89 | 108.88 | 10.458 | 12.2870 | 108.810 |
| 4 | 9.0157 | 1019.1 | 89.099 | 2.4044 | 19.206 | 1043.5 | 541.25 | 110.09 | 10.464 | 9.8229 | 100.020 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7147 | 4.8631 | 1027.0 | 81.084 | 4.2825 | 34.045 | 1100.0 | 529.98 | 168.38 | 14.290 | 1.2538 | 78.397 |
| 7148 | 4.5173 | 1027.4 | 80.813 | 4.2481 | 33.904 | 1100.1 | 530.47 | 168.07 | 14.344 | 1.0808 | 78.251 |
| 7149 | 4.2717 | 1027.9 | 80.380 | 4.2817 | 34.165 | 1099.9 | 529.56 | 168.55 | 14.395 | 1.0472 | 77.269 |
| 7150 | 4.0853 | 1028.6 | 78.907 | 4.2313 | 33.802 | 1100.1 | 530.61 | 167.98 | 14.343 | 1.0875 | 77.985 |
| 7151 | 4.2148 | 1029.4 | 70.679 | 4.2049 | 33.768 | 1100.0 | 530.97 | 167.30 | 14.291 | 1.1337 | 78.950 |
7152 rows × 11 columns
Before doing anything else with the data let's see if there are any null values (missing data) in any of the columns.
df.isnull().any()
AT False AP False AH False AFDP False GTEP False TIT False TAT False TEY False CDP False CO False NOX False dtype: bool
df.isnull().sum()
AT 0 AP 0 AH 0 AFDP 0 GTEP 0 TIT 0 TAT 0 TEY 0 CDP 0 CO 0 NOX 0 dtype: int64
We have no missing data so all the entries are valid for use.
df.shape
(7152, 11)
df.columns
Index(['AT', 'AP', 'AH', 'AFDP', 'GTEP', 'TIT', 'TAT', 'TEY', 'CDP', 'CO',
'NOX'],
dtype='object')
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7152 entries, 0 to 7151 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 AT 7152 non-null float64 1 AP 7152 non-null float64 2 AH 7152 non-null float64 3 AFDP 7152 non-null float64 4 GTEP 7152 non-null float64 5 TIT 7152 non-null float64 6 TAT 7152 non-null float64 7 TEY 7152 non-null float64 8 CDP 7152 non-null float64 9 CO 7152 non-null float64 10 NOX 7152 non-null float64 dtypes: float64(11) memory usage: 614.8 KB
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| AT | 7152.0 | 17.602620 | 6.862890 | 0.289850 | 12.048750 | 17.20450 | 23.164000 | 33.8730 |
| AP | 7152.0 | 1011.999607 | 6.290065 | 989.380000 | 1008.400000 | 1011.80000 | 1016.000000 | 1029.7000 |
| AH | 7152.0 | 80.461624 | 14.125390 | 27.504000 | 71.493500 | 84.00200 | 91.579000 | 100.1900 |
| AFDP | 7152.0 | 3.695958 | 0.805829 | 2.329500 | 3.100350 | 3.62785 | 4.156825 | 6.9779 |
| GTEP | 7152.0 | 25.105097 | 4.350711 | 18.104000 | 21.385000 | 24.85250 | 26.385750 | 36.9500 |
| TIT | 7152.0 | 1081.569463 | 17.385147 | 1022.100000 | 1065.975000 | 1087.30000 | 1094.400000 | 1100.5000 |
| TAT | 7152.0 | 545.780885 | 7.358935 | 518.320000 | 543.745000 | 549.90000 | 550.030000 | 550.5300 |
| TEY | 7152.0 | 132.168342 | 16.348156 | 101.480000 | 118.005000 | 133.57000 | 135.520000 | 172.9600 |
| CDP | 7152.0 | 11.971586 | 1.132159 | 9.875400 | 11.001250 | 11.95600 | 12.319250 | 14.8670 |
| CO | 7152.0 | 2.723031 | 2.363913 | 0.005033 | 1.257975 | 1.78270 | 3.591225 | 35.0450 |
| NOX | 7152.0 | 70.007899 | 12.048249 | 43.198000 | 62.269000 | 68.65100 | 76.001500 | 119.9100 |
Here we get max,min,mean and std for all the attributes.
# Number of unique values for every feature
df.nunique()
AT 6266 AP 483 AH 6488 AFDP 6097 GTEP 4990 TIT 633 TAT 1566 TEY 2820 CDP 2646 CO 6600 NOX 6470 dtype: int64
fig, ax=plt.subplots(4,3, figsize=(19,6), sharex= False, sharey = False)
fig.subplots_adjust(top=2)
sns.histplot(df['AT'],kde=True,color="red",ax=ax[0,0])
sns.histplot(df['AP'],label="Hist Of TEY",kde=True,ax=ax[0,1])
sns.histplot(df['AH'],kde=True,color="green",ax=ax[0,2])
sns.histplot(df['AFDP'],kde=True,color="orange",ax=ax[1,0])
sns.histplot(df['GTEP'],kde=True,color="purple",ax=ax[1,1])
sns.histplot(df['TIT'],kde=True,color="grey",ax=ax[1,2])
sns.histplot(df['TAT'],kde=True,color="blue",ax=ax[2,0])
sns.histplot(df['TEY'],kde=True,color="brown",ax=ax[2,1])
sns.histplot(df['CDP'],kde=True,color="black",ax=ax[2,2])
sns.histplot(df['CO'],kde=True,color="blue",ax=ax[3,0])
sns.histplot(df['NOX'],kde=True,color="red",ax=ax[3,1])
<AxesSubplot:xlabel='NOX', ylabel='Count'>
Some of the features are normally distributed. The features AH, CO, TITy and TATa exhibit the highest skew coefficients. Moreover, the distribution of Carbon Mono oxide (CO) and Turbine inlet temperature (TIT) and Turbine after temperature (TAT) seem to contain many outliers.
fig, ax=plt.subplots(nrows=4,ncols=3,figsize=(15,6))
df.plot(y='AT',ax=ax[0,0],color='red')
df.plot(y='AP',ax=ax[0,1])
df.plot(y='AH',ax=ax[0,2],color='green')
df.plot(y='AFDP',ax=ax[1,0],color='orange')
df.plot(y='GTEP',ax=ax[1,1],color='purple')
df.plot(y='TIT',ax=ax[1,2],color='grey')
df.plot(y='TAT',ax=ax[2,0],color='blue')
df.plot(y='TEY',ax=ax[2,1],color='brown')
df.plot(y='CDP',ax=ax[2,2],color='black')
df.plot(y='CO',ax=ax[3,0],color='blue')
df.plot(y='NOX',ax=ax[3,1],color='red')
fig.tight_layout(pad=1)
It's a simple plot between number of observation Vs the attributes such as (y=AT,AP,AH....)
sns.pairplot(df)
<seaborn.axisgrid.PairGrid at 0x1e2c9699280>
#check for outliers
ot=df.copy()
fig, axes=plt.subplots(11,1,figsize=(14,16),sharex=False,sharey=False)
sns.boxplot(x='AT',data=ot,palette='crest',ax=axes[0])
sns.boxplot(x='AP',data=ot,palette='crest',ax=axes[1])
sns.boxplot(x='AH',data=ot,palette='crest',ax=axes[2])
sns.boxplot(x='AFDP',data=ot,palette='crest',ax=axes[3])
sns.boxplot(x='GTEP',data=ot,palette='crest',ax=axes[4])
sns.boxplot(x='TIT',data=ot,palette='crest',ax=axes[5])
sns.boxplot(x='TAT',data=ot,palette='crest',ax=axes[6])
sns.boxplot(x='TEY',data=ot,palette='crest',ax=axes[7])
sns.boxplot(x='CDP',data=ot,palette='crest',ax=axes[8])
sns.boxplot(x='CO',data=ot,palette='crest',ax=axes[9])
sns.boxplot(x='NOX',data=ot,palette='crest',ax=axes[10])
plt.tight_layout(pad=2.0)
numerical_features = df.describe(include=["int64","float64"]).columns
numerical_features
Index(['AT', 'AP', 'AH', 'AFDP', 'GTEP', 'TIT', 'TAT', 'TEY', 'CDP', 'CO',
'NOX'],
dtype='object')
#outlier
plt.figure(figsize=(20,16))
sns.boxplot(data=df[numerical_features], orient="h")
<AxesSubplot:>
Pearson's Correlation Coefficient: helps you find out the relationship between two quantities. It gives you the measure of the strength of association between two variables. The value of Pearson's Correlation Coefficient can be between -1 to +1. 1 means that they are highly correlated and 0 means no correlation.
A heat map is a two-dimensional representation of information with the help of colors. Heat maps can help the user visualize simple or complex information
# Scatter Plot to Target Variable
sns.pairplot(df, x_vars=['AT','AP','AH','AFDP','GTEP','TIT','TAT','CDP','CO','NOX'],kind='scatter',y_vars='TEY',height=3,aspect=0.6)
plt.show()
x=df.mean()
fig1,ax1 = plt.subplots()
explode = (0, 0.1, 0, 0)
ax1.pie(x,labels=df[1:12],autopct='%0.0f%%',
shadow=True, startangle=90)
plt.show()
corr = pd.DataFrame(data = df.corr().iloc[:,9], index=df.columns)
corr = corr.sort_values(by='CO', ascending=False)
corr
| CO | |
|---|---|
| CO | 1.000000 |
| NOX | 0.366217 |
| AH | 0.247851 |
| TAT | 0.155655 |
| AP | -0.109782 |
| AT | -0.157783 |
| AFDP | -0.479581 |
| GTEP | -0.642176 |
| CDP | -0.655751 |
| TEY | -0.668985 |
| TIT | -0.806942 |
fig= plt.figure(figsize=(18, 10))
sns.heatmap(df.corr(), annot=True);
plt.xticks(rotation=45)
plt.title("Correlation Map of variables", fontsize=19)
Text(0.5, 1.0, 'Correlation Map of variables')
plt.title("Correlation plot between Target variables and independent variables", y=1.01, fontsize=18)
sns.barplot(x = corr.index, y = corr.CO)
sns.set(rc={'figure.figsize':(20,10)})
A correlation coefficient measures the strength of the relationship between two variables. The most commonly used correlation coefficient is the Pearson coefficient, which ranges from -1.0 to +1.0. A positive correlation indicates two variables that tend to move in the same direction. A negative correlation indicates two variables that tend to move in opposite directions. A correlation coefficient of -0.8 or lower indicates a strong negative relationship, while a coefficient of -0.3 or lower indicates a very weak one.
While in above correlation we can observe that their is strong negative correlation between CO and Turbine parameters
In NOX
corr = pd.DataFrame(data = df.corr().iloc[:,10], index=df.columns)
corr = corr.sort_values(by='NOX', ascending=False)
corr
| NOX | |
|---|---|
| NOX | 1.000000 |
| CO | 0.366217 |
| AH | 0.182527 |
| AP | 0.096800 |
| TEY | 0.040766 |
| CDP | -0.005352 |
| GTEP | -0.024444 |
| TIT | -0.122998 |
| TAT | -0.179357 |
| AFDP | -0.386677 |
| AT | -0.581687 |
plt.title("Correlation plot between Target variables and independent variables", y=1.01, fontsize=18)
sns.barplot(x = corr.index, y = corr.NOX)
sns.set(rc={'figure.figsize':(20,10)})
In case of NOX their is strong negative relation between AT and NOX,while mild negative relation between NOX and AFDP respectively.
TEY corr:
corr = pd.DataFrame(data = df.corr().iloc[:,7], index=df.columns)
corr = corr.sort_values(by='TEY', ascending=False)
corr
| TEY | |
|---|---|
| TEY | 1.000000 |
| CDP | 0.990425 |
| GTEP | 0.981677 |
| TIT | 0.917162 |
| AFDP | 0.540439 |
| AP | 0.226761 |
| NOX | 0.040766 |
| AH | -0.115436 |
| AT | -0.165419 |
| CO | -0.668985 |
| TAT | -0.722024 |
plt.title("Correlation plot between Target variables and independent variables", y=1.01, fontsize=18)
sns.barplot(x = corr.index, y = corr.TEY)
sns.set(rc={'figure.figsize':(20,10)})
Have Strong positive correlation with CDP , GTEP and TIT while CO and TAT have high negative correlation.
Above fig reveals that existence of a very strong linear dependency among the input variables, particularly between compressor discharge pressure (CDP) and turbine energy yield (TEY) (0.99), similarly CDP and gas turbine exhaust pressure (GTEP) (0.98). GTEP has also very strong correlation with TEY (0.96). This shows that some of the features may contain redundant information, and thus can be eliminated during model learning. Moreover, we see that the five turbine parameters (namely GTEP, CDP, AFDP, TIT, and TAT) have stronger correlations with TEY, compared to the three ambient variables (AT, AP, and AH) used as features
For CO: we see that the five turbine parameters (namely GTEP, CDP, AFDP, TIT, and TEY) have stronger correlations with CO. SO we will look into only these five data set
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
#Standardize & Normalize the data
norm = MinMaxScaler()
std = StandardScaler()
df_norm = pd.DataFrame(norm.fit_transform(df), columns=df.columns) #data between -3 to +3
df_std = pd.DataFrame(std.fit_transform(df), columns=df.columns)
df_norm
| AT | AP | AH | AFDP | GTEP | TIT | TAT | TEY | CDP | CO | NOX | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.270613 | 0.761905 | 0.863412 | 0.013596 | 0.056351 | 0.274235 | 0.709097 | 0.121433 | 0.137952 | 0.266615 | 0.724046 |
| 1 | 0.268249 | 0.756944 | 0.858900 | 0.009401 | 0.053858 | 0.227041 | 0.640174 | 0.108422 | 0.139554 | 0.314240 | 0.796381 |
| 2 | 0.263342 | 0.754464 | 0.857992 | 0.012026 | 0.056988 | 0.241071 | 0.656628 | 0.113878 | 0.133745 | 0.306649 | 0.785692 |
| 3 | 0.258512 | 0.742063 | 0.852791 | 0.011402 | 0.057094 | 0.191327 | 0.576529 | 0.103525 | 0.116716 | 0.350513 | 0.855303 |
| 4 | 0.259828 | 0.737103 | 0.847412 | 0.016113 | 0.058474 | 0.272959 | 0.711891 | 0.120453 | 0.117918 | 0.280191 | 0.740719 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7147 | 0.136177 | 0.933036 | 0.737143 | 0.420145 | 0.845856 | 0.993622 | 0.361999 | 0.935926 | 0.884406 | 0.035638 | 0.458846 |
| 7148 | 0.125880 | 0.942956 | 0.733415 | 0.412744 | 0.838374 | 0.994898 | 0.377212 | 0.931589 | 0.895224 | 0.030701 | 0.456943 |
| 7149 | 0.118567 | 0.955357 | 0.727458 | 0.419972 | 0.852223 | 0.992347 | 0.348960 | 0.938304 | 0.905441 | 0.029742 | 0.444142 |
| 7150 | 0.113016 | 0.972718 | 0.707193 | 0.409130 | 0.832962 | 0.994898 | 0.381559 | 0.930330 | 0.895024 | 0.030892 | 0.453475 |
| 7151 | 0.116873 | 0.992560 | 0.593993 | 0.403451 | 0.831158 | 0.993622 | 0.392735 | 0.920817 | 0.884606 | 0.032211 | 0.466055 |
7152 rows × 11 columns
df_std
| AT | AP | AH | AFDP | GTEP | TIT | TAT | TEY | CDP | CO | NOX | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -1.198518 | 1.287898 | 0.693861 | -1.617401 | -1.365182 | -2.184170 | -0.627972 | -1.346322 | -1.243364 | 2.802401 | 2.385003 |
| 1 | -1.210088 | 1.256099 | 0.670639 | -1.641601 | -1.375986 | -2.397010 | -0.929668 | -1.403213 | -1.236297 | 3.508398 | 2.845600 |
| 2 | -1.234103 | 1.240200 | 0.665966 | -1.626460 | -1.362424 | -2.333734 | -0.857641 | -1.379355 | -1.261913 | 3.395865 | 2.777536 |
| 3 | -1.257739 | 1.160704 | 0.639204 | -1.630059 | -1.361964 | -2.558079 | -1.208260 | -1.424624 | -1.336997 | 4.046104 | 3.220785 |
| 4 | -1.251298 | 1.128906 | 0.611521 | -1.602880 | -1.355987 | -2.189923 | -0.615742 | -1.350604 | -1.331697 | 3.003649 | 2.491167 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7147 | -1.856421 | 2.384942 | 0.044064 | 0.727925 | 2.054958 | 1.060205 | -2.147320 | 2.215185 | 2.047925 | -0.621568 | 0.696341 |
| 7148 | -1.906811 | 2.448539 | 0.024877 | 0.685233 | 2.022547 | 1.065958 | -2.080730 | 2.196221 | 2.095625 | -0.694757 | 0.684222 |
| 7149 | -1.942600 | 2.528035 | -0.005779 | 0.726932 | 2.082542 | 1.054453 | -2.204398 | 2.225584 | 2.140675 | -0.708972 | 0.602711 |
| 7150 | -1.969763 | 2.639329 | -0.110067 | 0.664384 | 1.999101 | 1.065958 | -2.061704 | 2.190716 | 2.094741 | -0.691923 | 0.662143 |
| 7151 | -1.950892 | 2.766523 | -0.692604 | 0.631620 | 1.991286 | 1.060205 | -2.012781 | 2.149118 | 2.048808 | -0.672377 | 0.742243 |
7152 rows × 11 columns
df1 = df_norm.drop(['AT', 'AP', 'AH', 'NOX'], axis=1)
df1
| AFDP | GTEP | TIT | TAT | TEY | CDP | CO | |
|---|---|---|---|---|---|---|---|
| 0 | 0.013596 | 0.056351 | 0.274235 | 0.709097 | 0.121433 | 0.137952 | 0.266615 |
| 1 | 0.009401 | 0.053858 | 0.227041 | 0.640174 | 0.108422 | 0.139554 | 0.314240 |
| 2 | 0.012026 | 0.056988 | 0.241071 | 0.656628 | 0.113878 | 0.133745 | 0.306649 |
| 3 | 0.011402 | 0.057094 | 0.191327 | 0.576529 | 0.103525 | 0.116716 | 0.350513 |
| 4 | 0.016113 | 0.058474 | 0.272959 | 0.711891 | 0.120453 | 0.117918 | 0.280191 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 7147 | 0.420145 | 0.845856 | 0.993622 | 0.361999 | 0.935926 | 0.884406 | 0.035638 |
| 7148 | 0.412744 | 0.838374 | 0.994898 | 0.377212 | 0.931589 | 0.895224 | 0.030701 |
| 7149 | 0.419972 | 0.852223 | 0.992347 | 0.348960 | 0.938304 | 0.905441 | 0.029742 |
| 7150 | 0.409130 | 0.832962 | 0.994898 | 0.381559 | 0.930330 | 0.895024 | 0.030892 |
| 7151 | 0.403451 | 0.831158 | 0.993622 | 0.392735 | 0.920817 | 0.884606 | 0.032211 |
7152 rows × 7 columns
Multiple linear regression is used to estimate the relationship between two or more independent variables and one dependent variable.
We can use multiple linear regression to find out how strong the relationship is between two or more independent variables and one dependent variable (e.g. How Ambient temperature(AT), Ambient pressure (AP), Ambient humidity(AH), Air filter difference pressure (AFDP), Gas turbine exhaust pressure (GTEP), Turbine inlet temperature(TIT), Turbine after temperature(TAT), Compressor discharge pressure (CDP), Carbon monoxide(CO) and Nitrogen oxides (NOx) added Turbine energy yield (TEY)). Also we can use multiple linear regression to find out the value of the dependent variable at a certain value of the independent variables (e.g. the expected Turbine energy yeild (TEY) at certain levels of Ambient temperature(AT), Ambient pressure (AP), Ambient humidity(AH), Air filter difference pressure (AFDP), Gas turbine exhaus pressure (GTEP), Turbine inlet temperature(TIT), Turbine after temperature(TAT), Compressor discharge pressure (CDP), Carbon monoxide(CO) and Nitrogen oxides (NOx)).
df1.shape
(7152, 7)
Multiple linear regression formula
yi = B0 + B1X1 + ... BnXn
y = the predicted value of the dependent variable B0 = the y-intercept (value of y when all other parameters are set to 0) B1X1= the regression coefficient (B1) of the first independent variable (X1) (a.k.a. the effect that increasing the value of the independent variable has on the predicted y value) … = do the same for however many independent variables you are testing BnXn = the regression coefficient of the last independent variable
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.model_selection import train_test_split
x=df_norm[list(df1)[0:6]]
y=df_norm[list(df1)[6]]
x #independent variable
| AFDP | GTEP | TIT | TAT | TEY | CDP | |
|---|---|---|---|---|---|---|
| 0 | 0.013596 | 0.056351 | 0.274235 | 0.709097 | 0.121433 | 0.137952 |
| 1 | 0.009401 | 0.053858 | 0.227041 | 0.640174 | 0.108422 | 0.139554 |
| 2 | 0.012026 | 0.056988 | 0.241071 | 0.656628 | 0.113878 | 0.133745 |
| 3 | 0.011402 | 0.057094 | 0.191327 | 0.576529 | 0.103525 | 0.116716 |
| 4 | 0.016113 | 0.058474 | 0.272959 | 0.711891 | 0.120453 | 0.117918 |
| ... | ... | ... | ... | ... | ... | ... |
| 7147 | 0.420145 | 0.845856 | 0.993622 | 0.361999 | 0.935926 | 0.884406 |
| 7148 | 0.412744 | 0.838374 | 0.994898 | 0.377212 | 0.931589 | 0.895224 |
| 7149 | 0.419972 | 0.852223 | 0.992347 | 0.348960 | 0.938304 | 0.905441 |
| 7150 | 0.409130 | 0.832962 | 0.994898 | 0.381559 | 0.930330 | 0.895024 |
| 7151 | 0.403451 | 0.831158 | 0.993622 | 0.392735 | 0.920817 | 0.884606 |
7152 rows × 6 columns
y #dependent variable
0 0.266615
1 0.314240
2 0.306649
3 0.350513
4 0.280191
...
7147 0.035638
7148 0.030701
7149 0.029742
7150 0.030892
7151 0.032211
Name: CO, Length: 7152, dtype: float64
We have divided the datas into “attributes” and “labels”.
Attributes are the independent variables while labels are dependent variables whose values are to be predicted.
# Training datasets
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2, random_state=42)
We have split 80 % of the data to the training set while 20 % of the data to test set using above code. The test_size variable is where we actually specify the proportion of the test set. For multiple times of execution of our model, random state make sure that data values will be same for training and testing data sets. It fixes the order of data for train_test_split
# Fitting the linear model with coefficients
model = LinearRegression()
model.fit(x_train,y_train) #training the algorithm
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression()
After splitting the data into training and testing sets, We have to train our algorithm. For that, we need to import LinearRegression class, instantiate it, and call the fit() method along with our training data.
print(model.intercept_)
print(list(zip(x, model.coef_)))
0.4192591719540044
[('AFDP', 0.018745316940409817), ('GTEP', -0.4782889880632093), ('TIT', -0.028349857615098478), ('TAT', -0.23465416460003213), ('TEY', -0.13099118461752557), ('CDP', 0.25888636443720564)]
As we have discussed that the linear regression model basically finds the best value for the intercept and slope, which results in a line that best fits the data. From above code, we can find the value of the intercept and slop calculated by the linear regression algorithm for our dataset
# Printing the intercept and coefficient of the model
print(model.intercept_)
print(list(zip(x, model.coef_)))
0.4192591719540044
[('AFDP', 0.018745316940409817), ('GTEP', -0.4782889880632093), ('TIT', -0.028349857615098478), ('TAT', -0.23465416460003213), ('TEY', -0.13099118461752557), ('CDP', 0.25888636443720564)]
As we have discussed that the linear regression model basically finds the best value for the intercept and slope, which results in a line that best fits the data. From above code, we can find the value of the intercept and slop calculated by the linear regression algorithm for our dataset
# Making prediction model
y_pred_model = model.predict(x_test)
y_pred_model
array([-0.00247846, 0.05038516, 0.07420109, ..., 0.05029164,
0.15129535, 0.15040129])
# New dataframe to plot regression
dataFrame_Pred_y = pd.DataFrame(y_pred_model, columns = ['CO'])
dataFrame_Pred_y.head()
| CO | |
|---|---|
| 0 | -0.002478 |
| 1 | 0.050385 |
| 2 | 0.074201 |
| 3 | 0.237299 |
| 4 | 0.032906 |
# Plotting regression with x_test as independent variable and dataFrame_Pred_y as dependent variable
sns.pairplot(df1,x_vars=x_test,y_vars=dataFrame_Pred_y,height = 3, aspect = 0.8,kind="reg");
# Plotting scatter plot for Predicted CO vs Actual CO
plt.scatter(y_test, y_pred_model)
plt.xlabel("CO: $Y_i$")
plt.ylabel("Predicted CO: $\hat{Y}_i$")
plt.title("CO vs Predicted CO: $Y_i$ vs $\hat{Y}_i$")
plt.show()
ActualandPredicted = pd.DataFrame({'Actual': y_test.values.flatten(), 'Predicted': y_pred_model.flatten()})
ActualandPredicted.head()
| Actual | Predicted | |
|---|---|---|
| 0 | 0.031389 | -0.002478 |
| 1 | 0.207496 | 0.050385 |
| 2 | 0.068969 | 0.074201 |
| 3 | 0.315211 | 0.237299 |
| 4 | 0.037793 | 0.032906 |
print(model.score(x_test, y_test))
0.7288118479015173
We have a score (Accuracy) of 72%
meanAbErr = metrics.mean_absolute_error(y_test, y_pred_model)
r2_score = metrics.r2_score(y_test, y_pred_model)
mean_squared_error = metrics.mean_squared_error(y_test, y_pred_model)
print('Mean Absolute Error:', meanAbErr) # Mean Absolute Error
print(f'R2 Score: {r2_score}') # R Squared value
print(f'Mean Squared Error : {mean_squared_error}') # Mean Squared Error
Mean Absolute Error: 0.02172786397484268 R2 Score: 0.7288118479015173 Mean Squared Error : 0.0010997277216344022
The Mean absolute error represents the average of the absolute difference between the actual and predicted values in the dataset. It measures the average of the residuals in the dataset.
R-squared (R2) is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model.
Mean Squared Error represents the average of the squared difference between the original and predicted values in the data set. It measures the variance of the residuals.
As per correlation matrix, we see the strongest correlation with the ambient temperature,which suggests working at higher temperatures is more appropriate to reduce this exhaust emission.
NOX:AT,AFDP,AH,TEY,AP
df2 = df_norm.drop(['GTEP', 'TIT', 'TAT', 'CDP','CO'], axis=1)
df2
| AT | AP | AH | AFDP | TEY | NOX | |
|---|---|---|---|---|---|---|
| 0 | 0.270613 | 0.761905 | 0.863412 | 0.013596 | 0.121433 | 0.724046 |
| 1 | 0.268249 | 0.756944 | 0.858900 | 0.009401 | 0.108422 | 0.796381 |
| 2 | 0.263342 | 0.754464 | 0.857992 | 0.012026 | 0.113878 | 0.785692 |
| 3 | 0.258512 | 0.742063 | 0.852791 | 0.011402 | 0.103525 | 0.855303 |
| 4 | 0.259828 | 0.737103 | 0.847412 | 0.016113 | 0.120453 | 0.740719 |
| ... | ... | ... | ... | ... | ... | ... |
| 7147 | 0.136177 | 0.933036 | 0.737143 | 0.420145 | 0.935926 | 0.458846 |
| 7148 | 0.125880 | 0.942956 | 0.733415 | 0.412744 | 0.931589 | 0.456943 |
| 7149 | 0.118567 | 0.955357 | 0.727458 | 0.419972 | 0.938304 | 0.444142 |
| 7150 | 0.113016 | 0.972718 | 0.707193 | 0.409130 | 0.930330 | 0.453475 |
| 7151 | 0.116873 | 0.992560 | 0.593993 | 0.403451 | 0.920817 | 0.466055 |
7152 rows × 6 columns
x1=df_norm[list(df2)[0:5]]
y1=df_norm[list(df2)[5]]
x1
| AT | AP | AH | AFDP | TEY | |
|---|---|---|---|---|---|
| 0 | 0.270613 | 0.761905 | 0.863412 | 0.013596 | 0.121433 |
| 1 | 0.268249 | 0.756944 | 0.858900 | 0.009401 | 0.108422 |
| 2 | 0.263342 | 0.754464 | 0.857992 | 0.012026 | 0.113878 |
| 3 | 0.258512 | 0.742063 | 0.852791 | 0.011402 | 0.103525 |
| 4 | 0.259828 | 0.737103 | 0.847412 | 0.016113 | 0.120453 |
| ... | ... | ... | ... | ... | ... |
| 7147 | 0.136177 | 0.933036 | 0.737143 | 0.420145 | 0.935926 |
| 7148 | 0.125880 | 0.942956 | 0.733415 | 0.412744 | 0.931589 |
| 7149 | 0.118567 | 0.955357 | 0.727458 | 0.419972 | 0.938304 |
| 7150 | 0.113016 | 0.972718 | 0.707193 | 0.409130 | 0.930330 |
| 7151 | 0.116873 | 0.992560 | 0.593993 | 0.403451 | 0.920817 |
7152 rows × 5 columns
y1
0 0.724046
1 0.796381
2 0.785692
3 0.855303
4 0.740719
...
7147 0.458846
7148 0.456943
7149 0.444142
7150 0.453475
7151 0.466055
Name: NOX, Length: 7152, dtype: float64
# Training datasets
x1_train, x1_test, y1_train, y1_test = train_test_split(x1,y1, test_size=0.2, random_state=42)
# Fitting the linear model with coefficients
model = LinearRegression()
model.fit(x1_train,y1_train) #training the algorithm
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression()
# Making prediction model
y1_pred_model = model.predict(x1_test)
y1_pred_model
array([0.32899022, 0.25068608, 0.32839916, ..., 0.28276362, 0.24990338,
0.21735277])
# New dataframe to plot regression
dataFrame_Pred_y1 = pd.DataFrame(y1_pred_model, columns = ['NOX'])
dataFrame_Pred_y1.head()
| NOX | |
|---|---|
| 0 | 0.328990 |
| 1 | 0.250686 |
| 2 | 0.328399 |
| 3 | 0.501372 |
| 4 | 0.307008 |
# Plotting regression with x_test as independent variable and dataFrame_Pred_y as dependent variable
sns.pairplot(df2,x_vars=x1_test,y_vars=dataFrame_Pred_y1,height = 3, aspect = 0.8,kind="reg");
# Plotting scatter plot for Predicted NOX vs Actual NOX
plt.scatter(y1_test, y1_pred_model)
plt.xlabel("NOX: $Y_i$")
plt.ylabel("Predicted NOX: $\hat{Y}_i$")
plt.title("NOX vs Predicted NOX: $Y_i$ vs $\hat{Y}_i$")
plt.show()
As you can see their is a variation between Predicted NOX vs Actual NOX while compare to CO case,which shows low accuracy wothout calculationg it.
ActualandPredicted1 = pd.DataFrame({'Actual': y1_test.values.flatten(), 'Predicted': y1_pred_model.flatten()})
ActualandPredicted1.head()
| Actual | Predicted | |
|---|---|---|
| 0 | 0.505775 | 0.328990 |
| 1 | 0.224945 | 0.250686 |
| 2 | 0.291310 | 0.328399 |
| 3 | 0.751799 | 0.501372 |
| 4 | 0.372849 | 0.307008 |
print(model.score(x1_test, y1_test))
0.4161093791983814
Hence accuracy of NOX using linear regression is not so good as compared to CO
meanAbErr = metrics.mean_absolute_error(y1_test, y1_pred_model)
r2_score = metrics.r2_score(y1_test, y1_pred_model)
mean_squared_error = metrics.mean_squared_error(y1_test, y1_pred_model)
print('Mean Absolute Error:', meanAbErr) # Mean Absolute Error
print(f'R2 Score: {r2_score}') # R Squared value
print(f'Mean Squared Error : {mean_squared_error}') # Mean Squared Error
Mean Absolute Error: 0.08888184390036134 R2 Score: 0.4161093791983814 Mean Squared Error : 0.014148070742767999
Accuracy: CO:72% NOX:41.6%
Therefore, CO definitely outperforms NOX in terms of accuracy, and we can attempt another method or algorithm for a better outcome.Their are other alternative method such as Random Forest,SVM,Decision Tree we can get good result.
Predicting CO and NOx emissions from gas turbines: novel data and a benchmark PEMS Heysem KAYA,Pınar TÜFEKCİ,Erdinç UZUN, Department of Computer Engineering, Çorlu Faculty of Engineering, Namık Kemal University, Tekirdağ, Turkey Turkish Journal of Electrical Engineering & Computer Sciences